gene product
Mitigating mode collapse in normalizing flows by annealing with an adaptive schedule: Application to parameter estimation
Wang, Yihang, Chi, Chris, Dinner, Aaron R.
Normalizing flows (NFs) provide uncorrelated samples from complex distributions, making them an appealing tool for parameter estimation. However, the practical utility of NFs remains limited by their tendency to collapse to a single mode of a multimodal distribution. In this study, we show that annealing with an adaptive schedule based on the effective sample size (ESS) can mitigate mode collapse. We demonstrate that our approach can converge the marginal likelihood for a biochemical oscillator model fit to time-series data in ten-fold less computation time than a widely used ensemble Markov chain Monte Carlo (MCMC) method. We show that the ESS can also be used to reduce variance by pruning the samples. We expect these developments to be of general use for sampling with NFs and discuss potential opportunities for further improvements.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
Miao, Yuwei, Guo, Yuzhi, Ma, Hehuan, Yan, Jingquan, Jiang, Feng, Liao, Rui, Huang, Junzhou
Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
- Information Technology > Biomedical Informatics > Translational Bioinformatics (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.62)
The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models
Abreu-Vicente, Jorge, Sonntag, Hannah, Eidens, Thomas, Lemberger, Thomas
Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement. Conclusions: SourceData-NLP's scale highlights the value of integrating curation into publishing. Models trained with SourceData-NLP will furthermore enable the development of tools able to extract causal hypotheses from the literature and assemble them into knowledge graphs.
- Europe > Bulgaria > Sofia City Province > Sofia (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Oregon > Multnomah County > Portland (0.04)
- (4 more...)
- Research Report > Experimental Study (0.47)
- Research Report > New Finding (0.46)
Multi-omics Prediction from High-content Cellular Imaging with Deep Learning
Mehrizi, Rahil, Mehrjou, Arash, Alegro, Maryana, Zhao, Yi, Carbone, Benedetta, Fishwick, Carl, Vappiani, Johanna, Bi, Jing, Sanford, Siobhan, Keles, Hakan, Bantscheff, Marcus, Nguyen, Cuong, Schwab, Patrick
High-content cellular imaging, transcriptomics, and proteomics data provide rich and complementary views on the molecular layers of biology that influence cellular states and function. However, the biological determinants through which changes in multi-omics measurements influence cellular morphology have not yet been systematically explored, and the degree to which cell imaging could potentially enable the prediction of multi-omics directly from cell imaging data is therefore currently unclear. Here, we address the question of whether it is possible to predict bulk multi-omics measurements directly from cell images using Image2Omics -- a deep learning approach that predicts multi-omics in a cell population directly from high-content images stained with multiplexed fluorescent dyes. We perform an experimental evaluation in gene-edited macrophages derived from human induced pluripotent stem cell (hiPSC) under multiple stimulation conditions and demonstrate that Image2Omics achieves significantly better performance in predicting transcriptomics and proteomics measurements directly from cell images than predictors based on the mean observed training set abundance. We observed significant predictability of abundances for 5903 (22.43%; 95% CI: 8.77%, 38.88%) and 5819 (22.11%; 95% CI: 10.40%, 38.08%) transcripts out of 26137 in M1 and M2-stimulated macrophages respectively and for 1933 (38.77%; 95% CI: 36.94%, 39.85%) and 2055 (41.22%; 95% CI: 39.31%, 42.42%) proteins out of 4986 in M1 and M2-stimulated macrophages respectively. Our results show that some transcript and protein abundances are predictable from cell imaging and that cell imaging may potentially, in some settings and depending on the mechanisms of interest and desired performance threshold, even be a scalable and resource-efficient substitute for multi-omics measurements.
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > Canada > Alberta > Census Division No. 13 > Athabasca County (0.04)
- Europe > United Kingdom (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Health & Medicine > Therapeutic Area > Oncology (0.67)
- Health & Medicine > Therapeutic Area > Hematology > Stem Cells (0.54)
- Health & Medicine > Therapeutic Area > Pulmonary/Respiratory Diseases (0.46)
Biomedical Knowledge Graph Refinement with Embedding and Logic Rules
Zhao, Sendong, Qin, Bing, Liu, Ting, Wang, Fei
Currently, there is a rapidly increasing need for high-quality biomedical knowledge graphs (BioKG) that provide direct and precise biomedical knowledge. In the context of COVID-19, this issue is even more necessary to be highlighted. However, most BioKG construction inevitably includes numerous conflicts and noises deriving from incorrect knowledge descriptions in literature and defective information extraction techniques. Many studies have demonstrated that reasoning upon the knowledge graph is effective in eliminating such conflicts and noises. This paper proposes a method BioGRER to improve the BioKG's quality, which comprehensively combines the knowledge graph embedding and logic rules that support and negate triplets in the BioKG. In the proposed model, the BioKG refinement problem is formulated as the probability estimation for triplets in the BioKG. We employ the variational EM algorithm to optimize knowledge graph embedding and logic rule inference alternately. In this way, our model could combine efforts from both the knowledge graph embedding and logic rules, leading to better results than using them alone. We evaluate our model over a COVID-19 knowledge graph and obtain competitive results.
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > New York > Richmond County > New York City (0.04)
- North America > United States > New York > Queens County > New York City (0.04)
- (8 more...)